Outline


  1. What and why Markdown
  2. Markdown basics
  3. What and why tidyverse
  4. tibble, pipes!!, dplyr, tidyr, readr

What is Markdown and why use it?

R Markdown provides an authoring framework for data science. You can use a single R Markdown file to both:

  • save and execute code
  • generate high quality reports that can be shared with an audience

R Markdown documents are fully reproducible and support dozens of static and dynamic output formats.

Code and comments in a more readable format, works well with version control repositories (like git), allows to you test code in chunks (clean environment!), and just a whole lot more convenience than back and forth between R and Word (especially when someone critiques your figures!).

You can knit to html, PDF, or Word (probably others, too).

Some examples of things in Markdown:

You can easily add images and gifs!!

You can easily add images and gifs!!


Markdown Basics

Code is organized into “chunks”

Here’s a basic chunk:
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
qqplot(iris$Petal.Length,iris$Petal.Width)


Here’s a chunk with some options applied:
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 


These are the most common chunk options…but there’s tons.

(This is a good time to learn about tab complete!)

  • include = FALSE Runs the chunk, but doesn’t show results or code.
  • echo = FALSE Shows the results, but not the code.
  • message = FALSE Suppresses any messages.
  • warning = FALSE Suppresses warnings.
  • fig.cap = "blah blah blah" Easily add a caption to your figures.

You can also set global options (they apply to all chunks) using knitr
e.g. knitr::opts_chunk$set(echo=TRUE)


You can pass several programming languages to chunks. bash, perl, python, R, etc.

ls *.gif
## cats.gif

You can make tables easily too (using a package like knitr or pander)

require(pander)
pander(head(iris), caption = "A table made with pander")
A table made with pander
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa

Oh yeah,

you can obviously make block quotes or text that is

bold or italic, etc. Check out the R Markdown cheatsheet.


                                                                                               

tidyverse!


This run-through made using lots of notes from Michael Levy and Ryan Peek!


What is it?

tidyverse is really just a collection of packages. There’s core packages, which you’ll likely use, and then other more specific packages.

install.packages(tidyverse) will install all of the ~20 packages

library(tidyverse) attaches the core packages only:

  • dplyr
  • tidyr
  • ggplot2
  • readr
  • purrr
  • tibble

Follows a consistent philosophy – Tidy data

Data goes into data frames

  • Each type of observation gets a data frame
  • Each variable gets a column
  • Each observation gets a row
Note: This idea should already be really familiar if you use ggplot2 for making figures…it’s a tidyverse package!!

Practical advantages

  • Incremental steps
  • Left to right operations with pipe!
  • Human readable
  • Consistency
    • Many functions take data_frame first, then piping
    • Easier to read, faster to write
  • The defaults aren’t dumb (e.g. utils::write.csv(row.names = FALSE) = readr::write_csv())

tibble

A less clunky version of data frames (also easier to build dummy data when you need help!)

tdf = tibble(x = 1:1e4
             , y = rnorm(1e4)) # == data_frame(x = 1:1e4, y = rnorm(1e4))
tdf

There’s much more with tibbles! Check out the manual!!


Pipes %>%

On a mac type shift + cmd + m

On a PC type shift + ctrl + m

A pipe just sends the output from the lefthand side of a function to the first argument on the right hand side of the function.

With pipes

sum(1:10) %>% 
  sqrt()
## [1] 7.416198

Without pipes:

sqrt(sum(1:10))
## [1] 7.416198

or

x = sum(1:10)
sqrt(x)
## [1] 7.416198

dplyr

dplyr is used to manipulate data (in data frames…).

Five core functions:

  • filter
  • select
  • arrange
  • group_by
  • summarise

…there’s a bunch more, too.


I got some pigeon racing data from the internet. It’s actually a mess, so to fix some of it real quick, let’s select only the few variables we want.

pg = read_csv("pigeon-racing.csv")

Let’s use names() to quickly see the names of the columns

names(pg)
##  [1] "Pos"      "Breeder"  "Pigeon"   "Name"     "Color"    "Sex"     
##  [7] "Ent"      "Arrival"  "Speed"    "To Win"   "Eligible"
pigeon = select(pg, Breeder, Pigeon, Color, Sex, Speed)
pigeon

Let’s look at only fast pigeons with filter

filter(pigeon, Speed > 150, Sex == "H")

With base R that’s accomplished with…

pigeon[pigeon$Speed > 150 & pigeon$Sex == "H", ]

Note that the dplyr version is less verbose, and doesn’t require remembering which side of the comma you’re on. Adding additional steps will also be simpler with the tidy format.

Let’s look at only female pigeons, and then see which breeder had the fastest pigeons. We can do this adding group_by and summarise. And we’re going to use pipes!

pigeon %>% 
  filter(Sex == "H") %>% 
  group_by(Breeder) %>% 
  summarise(breed.speed = mean(Speed), entries = n()) %>% 
  arrange(desc(breed.speed))

And, of course, if we wanted, we could have done the initial select all within a chunk. Plus it works nicely with ggplot2.

pg %>% 
  select(Breeder, Pigeon, Color, Sex, Speed) %>% 
  filter(Sex == "H") %>% 
  group_by(Breeder) %>% 
  summarise(breed.speed = mean(Speed), entries = n()) %>% 
  arrange(desc(breed.speed)) %>% 
  ggplot(aes(x=entries,y=breed.speed,color=Breeder)) +
  geom_point() + 
  labs(x = "Number of Pigeons a Breeder Entered", y = "Mean Speed of a Breeder's Pigeons") +
  theme_classic() + theme(legend.position = "none")


tidyr

tidyr can gather to make wide tables long, and spread to make long tables wide. You’re most likely to use gather.

religion = read_csv("religion.csv")
## Parsed with column specification:
## cols(
##   religion = col_character(),
##   less.than.30k = col_double(),
##   more30less50 = col_double(),
##   more50less100 = col_double(),
##   more100 = col_double(),
##   sampleSize = col_number()
## )
religion

There’s really 3 variables: religion, income, and frequency (sample size, too)

religion %>% 
  gather(income, frequency, -religion, -sampleSize) %>% 
  arrange(religion)

An example with real data we’ve all received before.

tricho = read_csv("trichorainfallpollinators.csv")
## Parsed with column specification:
## cols(
##   num = col_integer(),
##   treat = col_character(),
##   date = col_character(),
##   time = col_time(format = ""),
##   num_flow = col_integer(),
##   num_lgb = col_integer(),
##   num_stripey = col_integer(),
##   num_bomb = col_integer(),
##   num_syrph = col_integer(),
##   num_tinyblackbee = col_integer(),
##   num_other = col_integer()
## )
tricho
tricho %>% 
  gather(species, count, -num, -treat, -date, -time, -num_flow)

Now we can do useful things with it like find visit rate

tricho_tidy = tricho %>% 
  gather(species, count, -num, -treat, -date, -time, -num_flow) %>% 
  mutate(observation = paste(num,treat,date)) %>% 
  group_by(observation, num_flow) %>% 
  summarise(visits = sum(count)) %>% 
  mutate(visit.rate = (visits/num_flow)/10)

tricho_tidy
require(plotly)
require(viridis)
ggplotly(ggplot(data = tricho_tidy, aes(x = num_flow, y = visit.rate, color = visits, text = paste("Observation:",observation))) + geom_point() + scale_color_viridis() + labs(x = "Number of Flowers", y = "Visits Per Flower Per Minute", color = "Raw Flowers Visited") + theme_classic() + theme(legend.position = "bottom"))